1 Introducción

En esta práctica vamos a estudiar la detección de intrusiones, que puede ser modelada como un problema de clasificación binaria. Es decir, el objetivo es determinar si el tráfico de la red es un comportamiento anormal o no.

Para logar dicho objetivo vamos a implementar diferentes modelos de predicción a través de inteligencia artifial, finalmente eligiendo el que menos error tenga (evitando a su vez overlapping).

library(class)
library(dplyr)
library(ggplot2)
library(rpart)
library(rpart.plot)
library(caret)

2 Los Datos

El conjunto de datos a auditar consiste en una amplia variedad de intrusiones simuladas en un entorno de red militar. Se creó un entorno para adquirir datos de volcado de TCP/IP sin procesar para una red simulando una LAN típica de las Fuerzas Aéreas estadounidenses. La LAN se enfocó como un entorno real y se atacó con múltiples ataques.

Una conexión es una secuencia de paquetes TCP que comienzan y terminan en algún momento entre los cuales los datos fluyen hacia y desde una dirección IP de origen a una dirección IP de destino bajo algún protocolo bien definido. Además, cada conexión se etiqueta como normal o como ataque con exactamente un tipo de ataque específico. Cada registro de conexión consta de unos 100 bytes.

Para cada conexión TCP/IP, se obtienen 41 características cuantitativas y cualitativas de los datos normales y de ataque (3 cualitativas y 38 cuantitativas). La variable de clase tiene dos categorías - Normal - Anómala

Train_data <- read.csv("./datos/Train_data.csv")

Test_data <-  read.csv("./datos/Test_data.csv")

Vamos a ver los primeros datos:

data <- Train_data
data

3 EDA

En primer lugar estudiamos los datos recogidos de cada evento de manera estadística para evaluar su posible relevancia para la generación de un modelo de predicción. A partir de ahora a estos datos se los llamará variables:

Análisis de duration. Relevante.

p <- data %>% mutate(log10_duration = log10(duration+0.5)) %>%
  select(class, log10_duration) %>%
  na.omit() %>%
  ggplot(aes(x=log10_duration, colour=class)) +
    geom_density(lwd=2)
p

p2 <- ggplot(data = data, aes(x = class, y = log10(duration+0.5), color = class)) +
      geom_boxplot(outlier.shape = NA) +
      geom_jitter(alpha = 0.3, width = 0.15) +
      scale_color_manual(values = c("gray50", "orangered2")) +
      theme_bw()

p2

Análisis por protocol_type. Relevante.

ggplot(data, aes(protocol_type)) + geom_bar()

ggplot(data, aes(protocol_type,fill=class))+ geom_bar()

Análisis de service. Relevante.

p <- data %>%
  select(class, service) %>%
  na.omit() %>%
  ggplot(aes(x=service, colour=class)) +
    geom_density(lwd=2)
p
## Warning: Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Groups with fewer than two data points have been dropped.
## Warning in max(ids, na.rm = TRUE): ningun argumento finito para max; retornando
## -Inf

## Warning in max(ids, na.rm = TRUE): ningun argumento finito para max; retornando
## -Inf

## Warning in max(ids, na.rm = TRUE): ningun argumento finito para max; retornando
## -Inf

ggplot(data, aes(service)) + geom_bar() + coord_flip()

ggplot(data, aes(service,fill=class))+ geom_bar()

Análisis de flag. Relevante.

p <- data %>%
  select(class, flag) %>%
  na.omit() %>%
  ggplot(aes(x=flag, colour=class)) +
    geom_density(lwd=2)
p
## Warning: Groups with fewer than two data points have been dropped.
## Warning in max(ids, na.rm = TRUE): ningun argumento finito para max; retornando
## -Inf

ggplot(data, aes(flag)) + geom_bar()

ggplot(data, aes(flag,fill=class))+ geom_bar()

Análisis de src_bytes. Relevante.

p <- data %>% mutate(log10_src_bytes = log10(src_bytes+0.5)) %>%
  select(class, log10_src_bytes) %>%
  na.omit() %>%
  ggplot(aes(x=log10_src_bytes, colour=class)) +
    geom_density(lwd=2)
p

p2 <- ggplot(data = data, aes(x = class, y = log10(src_bytes+0.5), color = class)) +
      geom_boxplot(outlier.shape = NA) +
      geom_jitter(alpha = 0.3, width = 0.15) +
      scale_color_manual(values = c("gray50", "orangered2")) +
      theme_bw()

p2

Análisis de dst_bytes. Relevante.

p <- data %>% mutate(log10_dst_bytes = log10(dst_bytes+0.5)) %>%
  select(class, log10_dst_bytes) %>%
  na.omit() %>%
  ggplot(aes(x=log10_dst_bytes, colour=class)) +
    geom_density(lwd=2)
p

p2 <- ggplot(data = data, aes(x = class, y = log10(dst_bytes+0.5), color = class)) +
      geom_boxplot(outlier.shape = NA) +
      geom_jitter(alpha = 0.3, width = 0.15) +
      scale_color_manual(values = c("gray50", "orangered2")) +
      theme_bw()

p2

Análisis de land. Irrelevante.

p <- data %>% mutate(log10_land = log10(land+0.5)) %>%
  select(class, log10_land) %>%
  na.omit() %>%
  ggplot(aes(x=log10_land, colour=class)) +
    geom_density(lwd=2)
p

p2 <- ggplot(data = data, aes(x = class, y = log10(land+0.5), color = class)) +
      geom_boxplot(outlier.shape = NA) +
      geom_jitter(alpha = 0.3, width = 0.15) +
      scale_color_manual(values = c("gray50", "orangered2")) +
      theme_bw()

p2

Análisis de wrong_fragment. Podría ser relevante.

p <- data %>% mutate(log10_wrong_fragment = log10(wrong_fragment+0.5)) %>%
  select(class, log10_wrong_fragment) %>%
  na.omit() %>%
  ggplot(aes(x=log10_wrong_fragment, colour=class)) +
    geom_density(lwd=2)
p

p2 <- ggplot(data = data, aes(x = class, y = log10(wrong_fragment+0.5), color = class)) +
      geom_boxplot(outlier.shape = NA) +
      geom_jitter(alpha = 0.3, width = 0.15) +
      scale_color_manual(values = c("gray50", "orangered2")) +
      theme_bw()

p2

Análisis de urgent. Irrelevante.

p <- data %>% mutate(log10_urgent = urgent) %>%
  select(class, log10_urgent) %>%
  na.omit() %>%
  ggplot(aes(x=log10_urgent, colour=class)) +
    geom_density(lwd=2)
p

p2 <- ggplot(data = data, aes(x = class, y = urgent, color = class)) +
      geom_boxplot(outlier.shape = NA) +
      geom_jitter(alpha = 0.3, width = 0.15) +
      scale_color_manual(values = c("gray50", "orangered2")) +
      theme_bw()

p2

Análisis de hot. Irrelevante.

p <- data %>% mutate(log10_hot = log10(hot+0.5)) %>%
  select(class, log10_hot) %>%
  na.omit() %>%
  ggplot(aes(x=log10_hot, colour=class)) +
    geom_density(lwd=2)
p

p2 <- ggplot(data = data, aes(x = class, y = log10(hot+0.5), color = class)) +
      geom_boxplot(outlier.shape = NA) +
      geom_jitter(alpha = 0.3, width = 0.15) +
      scale_color_manual(values = c("gray50", "orangered2")) +
      theme_bw()

p2

Análisis de num_failed_logins. Irrelevante.

p <- data %>% mutate(log10_num_failed_logins = log10(num_failed_logins+0.5)) %>%
  select(class, log10_num_failed_logins) %>%
  na.omit() %>%
  ggplot(aes(x=log10_num_failed_logins, colour=class)) +
    geom_density(lwd=2)
p

p2 <- ggplot(data = data, aes(x = class, y = log10(num_failed_logins+0.5), color = class)) +
      geom_boxplot(outlier.shape = NA) +
      geom_jitter(alpha = 0.3, width = 0.15) +
      scale_color_manual(values = c("gray50", "orangered2")) +
      theme_bw()

p2

Análisis de logged_in. Relevante.

p <- data %>% 
  select(class, logged_in) %>%
  na.omit() %>%
  ggplot(aes(x=logged_in, colour=class)) +
    geom_density(lwd=2)
p

p2 <- ggplot(data = data, aes(x = class, y = logged_in, color = class)) +
      geom_boxplot(outlier.shape = NA) +
      geom_jitter(alpha = 0.3, width = 0.15) +
      scale_color_manual(values = c("gray50", "orangered2")) +
      theme_bw()

p2

Análisis de num_compromised. Irrelevante.

p <- data %>% mutate(log10_num_compromised = log10(num_compromised+0.5)) %>%
  select(class, log10_num_compromised) %>%
  na.omit() %>%
  ggplot(aes(x=log10_num_compromised, colour=class)) +
    geom_density(lwd=2)
p

p2 <- ggplot(data = data, aes(x = class, y = log10(num_compromised+0.5), color = class)) +
      geom_boxplot(outlier.shape = NA) +
      geom_jitter(alpha = 0.3, width = 0.15) +
      scale_color_manual(values = c("gray50", "orangered2")) +
      theme_bw()

p2

Análisis de root_shell. Irrelevante.

datos <- data %>% mutate(root_shell_2 = case_when(root_shell == 0 ~ "Sí",
root_shell == 1 ~ "No"))
datos$root_shell_2 <- as.factor(datos$root_shell_2)

p1 <- ggplot(data = datos, aes(x = root_shell_2, fill = class)) +
      geom_density(alpha = 0.5) +
      scale_fill_manual(values = c("gray50", "orangered2")) +
      geom_rug(aes(color = class), alpha = 0.5) +
      scale_color_manual(values = c("gray50", "orangered2")) +
      theme_bw()
p2 <- ggplot(data = datos, aes(x = class, y = root_shell_2, color = class)) +
      geom_boxplot(outlier.shape = NA) +
      geom_jitter(alpha = 0.3, width = 0.15) +
      scale_color_manual(values = c("gray50", "orangered2")) +
      theme_bw()
p1

p2

Análisis de su_attempted. Irrelevante.

p <- data  %>%
  select(class, su_attempted) %>%
  na.omit() %>%
  ggplot(aes(x=su_attempted, colour=class)) +
    geom_density(lwd=2)
p

ggplot(data, aes(su_attempted)) + geom_bar()

ggplot(data, aes(su_attempted,fill=class))+ geom_bar()

Análisis de num_root. Irrelevante.

p <- data %>% mutate(log10_num_root = log10(num_root+0.5)) %>%
  select(class, log10_num_root) %>%
  na.omit() %>%
  ggplot(aes(x=log10_num_root, colour=class)) +
    geom_density(lwd=2)
p

p2 <- ggplot(data = data, aes(x = class, y = log10(num_root+0.5), color = class)) +
      geom_boxplot(outlier.shape = NA) +
      geom_jitter(alpha = 0.3, width = 0.15) +
      scale_color_manual(values = c("gray50", "orangered2")) +
      theme_bw()

p2

Análisis de num_file_creations. Irrelevante.

p <- data %>% mutate(log10_num_file_creations = log10(num_file_creations+0.5)) %>%
  select(class, log10_num_file_creations) %>%
  na.omit() %>%
  ggplot(aes(x=log10_num_file_creations, colour=class)) +
    geom_density(lwd=2)
p

p2 <- ggplot(data = data, aes(x = class, y = log10(num_file_creations+0.5), color = class)) +
      geom_boxplot(outlier.shape = NA) +
      geom_jitter(alpha = 0.3, width = 0.15) +
      scale_color_manual(values = c("gray50", "orangered2")) +
      theme_bw()

p2

Análisis de num_shells. Irrelevante.

p <- data %>%
  select(class, num_shells) %>%
  na.omit() %>%
  ggplot(aes(x=num_shells, colour=class)) +
    geom_density(lwd=2)
p

p2 <- ggplot(data = data, aes(x = class, y = num_shells, color = class)) +
      geom_boxplot(outlier.shape = NA) +
      geom_jitter(alpha = 0.3, width = 0.15) +
      scale_color_manual(values = c("gray50", "orangered2")) +
      theme_bw()

p2

Análisis de num_access_files. Irrelevante.

p <- data %>%
  select(class, num_access_files) %>%
  na.omit() %>%
  ggplot(aes(x=num_access_files, colour=class)) +
    geom_density(lwd=2)
p

p2 <- ggplot(data = data, aes(x = class, y = num_access_files, color = class)) +
      geom_boxplot(outlier.shape = NA) +
      geom_jitter(alpha = 0.3, width = 0.15) +
      scale_color_manual(values = c("gray50", "orangered2")) +
      theme_bw()

p2

Análisis de num_outbound_cmds. Irrelevante.

p <- data  %>%
  select(class, num_outbound_cmds) %>%
  na.omit() %>%
  ggplot(aes(x=num_outbound_cmds, colour=class)) +
    geom_density(lwd=2)
p

p2 <- ggplot(data = data, aes(x = class, y = num_outbound_cmds, color = class)) +
      geom_boxplot(outlier.shape = NA) +
      geom_jitter(alpha = 0.3, width = 0.15) +
      scale_color_manual(values = c("gray50", "orangered2")) +
      theme_bw()

p2

Análisis de is_host_login. Irrelevante.

p <- data  %>%
  select(class, is_host_login) %>%
  na.omit() %>%
  ggplot(aes(x=is_host_login, colour=class)) +
    geom_density(lwd=2)
p

p2 <- ggplot(data = data, aes(x = class, y = is_host_login, color = class)) +
      geom_boxplot(outlier.shape = NA) +
      geom_jitter(alpha = 0.3, width = 0.15) +
      scale_color_manual(values = c("gray50", "orangered2")) +
      theme_bw()

p2

Análisis de is_guest_login. Irrelevante.

p <- data  %>%
  select(class, is_guest_login) %>%
  na.omit() %>%
  ggplot(aes(x=is_guest_login, colour=class)) +
    geom_density(lwd=2)
p

p2 <- ggplot(data = data, aes(x = class, y = is_guest_login, color = class)) +
      geom_boxplot(outlier.shape = NA) +
      geom_jitter(alpha = 0.3, width = 0.15) +
      scale_color_manual(values = c("gray50", "orangered2")) +
      theme_bw()

p2

Análisis de count. Relevante.

p <- data %>% mutate(log10_count = log10(count+0.5)) %>%
  select(class, log10_count) %>%
  na.omit() %>%
  ggplot(aes(x=log10_count, colour=class)) +
    geom_density(lwd=2)
p

p2 <- ggplot(data = data, aes(x = class, y = log10(count+0.5), color = class)) +
      geom_boxplot(outlier.shape = NA) +
      geom_jitter(alpha = 0.3, width = 0.15) +
      scale_color_manual(values = c("gray50", "orangered2")) +
      theme_bw()

p2

Análisis de srv_count. Parece relevante.

p <- data %>% mutate(log10_srv_count = log10(srv_count+0.5)) %>%
  select(class, log10_srv_count) %>%
  na.omit() %>%
  ggplot(aes(x=log10_srv_count, colour=class)) +
    geom_density(lwd=2)
p

p2 <- ggplot(data = data, aes(x = class, y = log10(srv_count+0.5), color = class)) +
      geom_boxplot(outlier.shape = NA) +
      geom_jitter(alpha = 0.3, width = 0.15) +
      scale_color_manual(values = c("gray50", "orangered2")) +
      theme_bw()

p2

Análisis de serror_rate. Relevante.

p <- data %>%
  select(class, serror_rate) %>%
  na.omit() %>%
  ggplot(aes(x=serror_rate, colour=class)) +
    geom_density(lwd=2)
p

p2 <- ggplot(data = data, aes(x = class, y = serror_rate, color = class)) +
      geom_boxplot(outlier.shape = NA) +
      geom_jitter(alpha = 0.3, width = 0.15) +
      scale_color_manual(values = c("gray50", "orangered2")) +
      theme_bw()

p2

Análisis de srv_serror_rate. Relevante.

p <- data  %>%
  select(class, srv_serror_rate) %>%
  na.omit() %>%
  ggplot(aes(x=srv_serror_rate, colour=class)) +
    geom_density(lwd=2)
p

p2 <- ggplot(data = data, aes(x = class, y = srv_serror_rate, color = class)) +
      geom_boxplot(outlier.shape = NA) +
      geom_jitter(alpha = 0.3, width = 0.15) +
      scale_color_manual(values = c("gray50", "orangered2")) +
      theme_bw()

p2

Análisis de rerror_rate. Parece relevante.

p <- data %>% mutate(log10_rerror_rate = log10(rerror_rate+0.5)) %>%
  select(class, log10_rerror_rate) %>%
  na.omit() %>%
  ggplot(aes(x=log10_rerror_rate, colour=class)) +
    geom_density(lwd=2)
p

p2 <- ggplot(data = data, aes(x = class, y = log10(rerror_rate+0.5), color = class)) +
      geom_boxplot(outlier.shape = NA) +
      geom_jitter(alpha = 0.3, width = 0.15) +
      scale_color_manual(values = c("gray50", "orangered2")) +
      theme_bw()

p2

Análisis de srv_rerror_rate. Podría ser relevante.

p <- data %>%
  select(class, srv_rerror_rate) %>%
  na.omit() %>%
  ggplot(aes(x=srv_rerror_rate, colour=class)) +
    geom_density(lwd=2)
p

p2 <- ggplot(data = data, aes(x = class, y = srv_rerror_rate, color = class)) +
      geom_boxplot(outlier.shape = NA) +
      geom_jitter(alpha = 0.3, width = 0.15) +
      scale_color_manual(values = c("gray50", "orangered2")) +
      theme_bw()

p2

Análisis de same_srv_rate. Relevante.

p <- data %>% mutate(log10_same_srv_rate = log10(same_srv_rate+0.5)) %>%
  select(class, log10_same_srv_rate) %>%
  na.omit() %>%
  ggplot(aes(x=log10_same_srv_rate, colour=class)) +
    geom_density(lwd=2)
p

p2 <- ggplot(data = data, aes(x = class, y = log10(same_srv_rate+0.5), color = class)) +
      geom_boxplot(outlier.shape = NA) +
      geom_jitter(alpha = 0.3, width = 0.15) +
      scale_color_manual(values = c("gray50", "orangered2")) +
      theme_bw()

p2

Análisis de diff_srv_rate. Relevante.

p <- data %>% mutate(log10_diff_srv_rate = log10(diff_srv_rate+0.5)) %>%
  select(class, log10_diff_srv_rate) %>%
  na.omit() %>%
  ggplot(aes(x=log10_diff_srv_rate, colour=class)) +
    geom_density(lwd=2)
p

p2 <- ggplot(data = data, aes(x = class, y = log10(diff_srv_rate+0.5), color = class)) +
      geom_boxplot(outlier.shape = NA) +
      geom_jitter(alpha = 0.3, width = 0.15) +
      scale_color_manual(values = c("gray50", "orangered2")) +
      theme_bw()

p2

Análisis de srv_diff_host_rate. Relevante.

p <- data %>% mutate(log10_srv_diff_host_rate = log10(srv_diff_host_rate+0.5)) %>%
  select(class, log10_srv_diff_host_rate) %>%
  na.omit() %>%
  ggplot(aes(x=log10_srv_diff_host_rate, colour=class)) +
    geom_density(lwd=2)
p

p2 <- ggplot(data = data, aes(x = class, y = log10(srv_diff_host_rate+0.5), color = class)) +
      geom_boxplot(outlier.shape = NA) +
      geom_jitter(alpha = 0.3, width = 0.15) +
      scale_color_manual(values = c("gray50", "orangered2")) +
      theme_bw()

p2

Análisis de dst_host_count. Relevante.

p <- data  %>%
  select(class, dst_host_count) %>%
  na.omit() %>%
  ggplot(aes(x=dst_host_count, colour=class)) +
    geom_density(lwd=2)
p

p2 <- ggplot(data = data, aes(x = class, y = dst_host_count, color = class)) +
      geom_boxplot(outlier.shape = NA) +
      geom_jitter(alpha = 0.3, width = 0.15) +
      scale_color_manual(values = c("gray50", "orangered2")) +
      theme_bw()

p2

Análisis de dst_host_srv_count. Relevante.

p <- data  %>%
  select(class, dst_host_srv_count) %>%
  na.omit() %>%
  ggplot(aes(x=dst_host_srv_count, colour=class)) +
    geom_density(lwd=2)
p

p2 <- ggplot(data = data, aes(x = class, y = dst_host_srv_count, color = class)) +
      geom_boxplot(outlier.shape = NA) +
      geom_jitter(alpha = 0.3, width = 0.15) +
      scale_color_manual(values = c("gray50", "orangered2")) +
      theme_bw()

p2

Análisis de dst_host_same_srv_rate. Relevante.

p <- data  %>%
  select(class, dst_host_same_srv_rate) %>%
  na.omit() %>%
  ggplot(aes(x=dst_host_same_srv_rate, colour=class)) +
    geom_density(lwd=2)
p

p2 <- ggplot(data = data, aes(x = class, y = dst_host_same_srv_rate, color = class)) +
      geom_boxplot(outlier.shape = NA) +
      geom_jitter(alpha = 0.3, width = 0.15) +
      scale_color_manual(values = c("gray50", "orangered2")) +
      theme_bw()

p2

Análisis de dst_host_diff_srv_rate. Relevante.

p <- data %>%
  select(class, dst_host_diff_srv_rate) %>%
  na.omit() %>%
  ggplot(aes(x=dst_host_diff_srv_rate, colour=class)) +
    geom_density(lwd=2)
p

p2 <- ggplot(data = data, aes(x = class, y = dst_host_diff_srv_rate, color = class)) +
      geom_boxplot(outlier.shape = NA) +
      geom_jitter(alpha = 0.3, width = 0.15) +
      scale_color_manual(values = c("gray50", "orangered2")) +
      theme_bw()

p2

Análisis de dst_host_same_src_port_rate. Podría ser relevante.

p <- data  %>%
  select(class, dst_host_same_src_port_rate) %>%
  na.omit() %>%
  ggplot(aes(x=dst_host_same_src_port_rate, colour=class)) +
    geom_density(lwd=2)
p

p2 <- ggplot(data = data, aes(x = class, y = dst_host_same_src_port_rate, color = class)) +
      geom_boxplot(outlier.shape = NA) +
      geom_jitter(alpha = 0.3, width = 0.15) +
      scale_color_manual(values = c("gray50", "orangered2")) +
      theme_bw()

p2

Análisis de dst_host_srv_diff_host_rate. Relevante.

p <- data %>%
  select(class, dst_host_srv_diff_host_rate) %>%
  na.omit() %>%
  ggplot(aes(x=dst_host_srv_diff_host_rate, colour=class)) +
    geom_density(lwd=2)
p

p2 <- ggplot(data = data, aes(x = class, y = dst_host_srv_diff_host_rate, color = class)) +
      geom_boxplot(outlier.shape = NA) +
      geom_jitter(alpha = 0.3, width = 0.15) +
      scale_color_manual(values = c("gray50", "orangered2")) +
      theme_bw()

p2

Análisis de dst_host_serror_rate. Relevante.

p <- data %>%
  select(class, dst_host_serror_rate) %>%
  na.omit() %>%
  ggplot(aes(x=dst_host_serror_rate, colour=class)) +
    geom_density(lwd=2)
p

p2 <- ggplot(data = data, aes(x = class, y = dst_host_serror_rate, color = class)) +
      geom_boxplot(outlier.shape = NA) +
      geom_jitter(alpha = 0.3, width = 0.15) +
      scale_color_manual(values = c("gray50", "orangered2")) +
      theme_bw()

p2

Análisis de dst_host_srv_serror_rate. Relevante.

p <- data %>%
  select(class, dst_host_srv_serror_rate) %>%
  na.omit() %>%
  ggplot(aes(x=dst_host_srv_serror_rate, colour=class)) +
    geom_density(lwd=2)
p

p2 <- ggplot(data = data, aes(x = class, y = dst_host_srv_serror_rate, color = class)) +
      geom_boxplot(outlier.shape = NA) +
      geom_jitter(alpha = 0.3, width = 0.15) +
      scale_color_manual(values = c("gray50", "orangered2")) +
      theme_bw()

p2

Análisis de dst_host_rerror_rate. Irrelevante.

p2 <- data  %>%
  select(class, dst_host_rerror_rate) %>%
  na.omit() %>%
  ggplot(aes(x=dst_host_rerror_rate, colour=class)) +
    geom_density(lwd=2)
p2

p2 <- ggplot(data = data, aes(x = class, y = dst_host_rerror_rate, color = class)) +
      geom_boxplot(outlier.shape = NA) +
      geom_jitter(alpha = 0.3, width = 0.15) +
      scale_color_manual(values = c("gray50", "orangered2")) +
      theme_bw()

p2

Análisis de dst_host_srv_rerror_rate. Relevante.

p <- data %>%
  select(class, dst_host_srv_rerror_rate) %>%
  na.omit() %>%
  ggplot(aes(x=dst_host_srv_rerror_rate, colour=class)) +
    geom_density(lwd=2)
p

p2 <- ggplot(data = data, aes(x = class, y = dst_host_srv_rerror_rate, color = class)) +
      geom_boxplot(outlier.shape = NA) +
      geom_jitter(alpha = 0.3, width = 0.15) +
      scale_color_manual(values = c("gray50", "orangered2")) +
      theme_bw()

p2

4 Modelos de Aprendizaje Máquina

Los diferentes modelos que vamos a utilizar son un decission tree y dos knn, cada uno de los knn con variables diferentes, y además de estudiarlos con diferentes k (1,3 y 5)

data <- Train_data[1:(nrow(Train_data)/2), ]

data_test <-Train_data[(nrow(Train_data)/2):nrow(Train_data),]



datos_train_t=data%>%
  select(duration, protocol_type, service, flag, src_bytes, dst_bytes, wrong_fragment, logged_in, count, srv_count, serror_rate, srv_serror_rate, rerror_rate, srv_rerror_rate, same_srv_rate, diff_srv_rate, srv_diff_host_rate, dst_host_count, dst_host_srv_count,  dst_host_same_srv_rate, dst_host_diff_srv_rate, dst_host_same_src_port_rate, dst_host_srv_diff_host_rate, dst_host_serror_rate, dst_host_srv_serror_rate, dst_host_srv_rerror_rate)

datos_train_t$service=as.numeric(as.factor(datos_train_t$service))
datos_train_t$protocol_type=as.numeric(as.factor(datos_train_t$protocol_type))
datos_train_t$flag=as.numeric(as.factor(datos_train_t$flag))

datos_test_t=data_test%>%
  select(duration, protocol_type, service, flag, src_bytes, dst_bytes, wrong_fragment, logged_in, count, srv_count, serror_rate, srv_serror_rate, rerror_rate, srv_rerror_rate, same_srv_rate, diff_srv_rate, srv_diff_host_rate, dst_host_count, dst_host_srv_count,  dst_host_same_srv_rate, dst_host_diff_srv_rate, dst_host_same_src_port_rate, dst_host_srv_diff_host_rate, dst_host_serror_rate, dst_host_srv_serror_rate, dst_host_srv_rerror_rate)

datos_test_t$service=as.numeric(as.factor(datos_test_t$service))
datos_test_t$protocol_type=as.numeric(as.factor(datos_test_t$protocol_type))
datos_test_t$flag=as.numeric(as.factor(datos_test_t$flag))

cl=factor(data$class)
k3_long=knn(cl=cl, train=datos_train_t, test=datos_train_t, k=3, prob=TRUE)
table(k3_long, cl)
##          cl
## k3_long   anomaly normal
##   anomaly    5864     46
##   normal       43   6643
cl_test_long=factor(data_test$class)
k3_long=knn(cl=cl,train=datos_train_t,test=datos_test_t,k=3,prob=TRUE)
table(k3_long,cl_test_long)
##          cl_test_long
## k3_long   anomaly normal
##   anomaly    5752     92
##   normal       84   6669
k1_long=knn(cl=cl,train=datos_train_t,test=datos_test_t,k=1,prob=TRUE)
k5_long=knn(cl=cl,train=datos_train_t,test=datos_test_t,k=5,prob=TRUE)
datos_train_t=data%>%
  select(protocol_type, service, flag, src_bytes, dst_bytes, wrong_fragment, logged_in, count, srv_count, serror_rate, srv_serror_rate, same_srv_rate, diff_srv_rate, srv_diff_host_rate, dst_host_count, dst_host_srv_count,  dst_host_same_srv_rate, dst_host_diff_srv_rate, dst_host_srv_diff_host_rate, dst_host_serror_rate, dst_host_srv_serror_rate)

datos_train_t$service=as.numeric(as.factor(datos_train_t$service))
datos_train_t$protocol_type=as.numeric(as.factor(datos_train_t$protocol_type))
datos_train_t$flag=as.numeric(as.factor(datos_train_t$flag))

datos_test_t=data_test%>%
  select(protocol_type, service, flag, src_bytes, dst_bytes, wrong_fragment, logged_in, count, srv_count, serror_rate, srv_serror_rate, same_srv_rate, diff_srv_rate, srv_diff_host_rate, dst_host_count, dst_host_srv_count,  dst_host_same_srv_rate, dst_host_diff_srv_rate, dst_host_srv_diff_host_rate, dst_host_serror_rate, dst_host_srv_serror_rate)

datos_test_t$service=as.numeric(as.factor(datos_test_t$service))
datos_test_t$protocol_type=as.numeric(as.factor(datos_test_t$protocol_type))
datos_test_t$flag=as.numeric(as.factor(datos_test_t$flag))

cl=factor(data$class)
k3_short=knn(cl=cl, train=datos_train_t, test=datos_train_t, k=3, prob=TRUE)
table(k3_short, cl)
##          cl
## k3_short  anomaly normal
##   anomaly    5870     40
##   normal       37   6649
cl_test_short=factor(data_test$class)
k3_short=knn(cl=cl,train=datos_train_t,test=datos_test_t,k=3,prob=TRUE)
table(k3_short,cl_test_short)
##          cl_test_short
## k3_short  anomaly normal
##   anomaly    5761     79
##   normal       75   6682
k1_short=knn(cl=cl,train=datos_train_t,test=datos_test_t,k=1,prob=TRUE)
k5_short=knn(cl=cl,train=datos_train_t,test=datos_test_t,k=5,prob=TRUE)
datos_train_t=data%>%
  select(duration, protocol_type, service, flag, src_bytes, dst_bytes, wrong_fragment, logged_in, count, srv_count, serror_rate, srv_serror_rate, rerror_rate, srv_rerror_rate, same_srv_rate, diff_srv_rate, srv_diff_host_rate, dst_host_count, dst_host_srv_count,  dst_host_same_srv_rate, dst_host_diff_srv_rate, dst_host_same_src_port_rate, dst_host_srv_diff_host_rate, dst_host_serror_rate, dst_host_srv_serror_rate, dst_host_srv_rerror_rate)


datos_test_t=data_test%>%
  select(duration, protocol_type, service, flag, src_bytes, dst_bytes, wrong_fragment, logged_in, count, srv_count, serror_rate, srv_serror_rate, rerror_rate, srv_rerror_rate, same_srv_rate, diff_srv_rate, srv_diff_host_rate, dst_host_count, dst_host_srv_count,  dst_host_same_srv_rate, dst_host_diff_srv_rate, dst_host_same_src_port_rate, dst_host_srv_diff_host_rate, dst_host_serror_rate, dst_host_srv_serror_rate, dst_host_srv_rerror_rate)


cl_train=factor(data$class)
datos_train_t=as.data.frame(datos_train_t)
datos_test_t=as.data.frame(datos_test_t)
dt1=rpart(cl_train~.,data=datos_train_t,control = rpart.control(cp = 0.01))
rpart.plot(dt1)

cl_test=factor(data_test$class)
table(predict(dt1,datos_train_t,type="class"),cl_train)
##          cl_train
##           anomaly normal
##   anomaly    5657     84
##   normal      250   6605
table(predict(dt1,datos_test_t,type="class"),cl_test)
##          cl_test
##           anomaly normal
##   anomaly    5562    105
##   normal      274   6656

5 Resultados

Las diferentes métricas de los modelos de predicción knn utilizados, siendo long o short la diferencia en número de variables previamente mostradas en el apartado anterior, y el número que acompaña a la k el valor de la misma:

#k5_long
accuracy = sum(k5_long == cl_test_long) /length(cl_test_long)
error = 1-accuracy
cat("k=5 long error ->", error, "\n")
## k=5 long error -> 0.0143685
confusionMatrix(table(k5_long,data_test$class))
## Confusion Matrix and Statistics
## 
##          
## k5_long   anomaly normal
##   anomaly    5744     89
##   normal       92   6672
##                                           
##                Accuracy : 0.9856          
##                  95% CI : (0.9834, 0.9876)
##     No Information Rate : 0.5367          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9711          
##                                           
##  Mcnemar's Test P-Value : 0.8818          
##                                           
##             Sensitivity : 0.9842          
##             Specificity : 0.9868          
##          Pos Pred Value : 0.9847          
##          Neg Pred Value : 0.9864          
##              Prevalence : 0.4633          
##          Detection Rate : 0.4560          
##    Detection Prevalence : 0.4630          
##       Balanced Accuracy : 0.9855          
##                                           
##        'Positive' Class : anomaly         
## 
sensitivity = sum(k5_long == data_test$class & data_test$class == "anomaly") / sum(data_test$class == "anomaly")
recall = sensitivity
specificity =  sum(k5_long == data_test$class & data_test$class == "normal") / sum(data_test$class == "normal")
precision = sum(k5_long == data_test$class & k5_long == "anomaly") / sum(k5_long == "anomaly")
npv = sum(k5_long == data_test$class & k5_long == "normal") / sum(k5_long == "normal")
f1score = 2*precision*recall /(precision+recall)
cat("k=5 long f1score ->", f1score, "\n")
## k=5 long f1score -> 0.9844888
#k3_long
accuracy = sum(k3_long == cl_test_long) /length(cl_test_long)
error = 1-accuracy
cat("k=3 long error ->", error, "\n")
## k=3 long error -> 0.01397158
confusionMatrix(table(k3_long,data_test$class))
## Confusion Matrix and Statistics
## 
##          
## k3_long   anomaly normal
##   anomaly    5752     92
##   normal       84   6669
##                                          
##                Accuracy : 0.986          
##                  95% CI : (0.9838, 0.988)
##     No Information Rate : 0.5367         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.9719         
##                                          
##  Mcnemar's Test P-Value : 0.5977         
##                                          
##             Sensitivity : 0.9856         
##             Specificity : 0.9864         
##          Pos Pred Value : 0.9843         
##          Neg Pred Value : 0.9876         
##              Prevalence : 0.4633         
##          Detection Rate : 0.4566         
##    Detection Prevalence : 0.4639         
##       Balanced Accuracy : 0.9860         
##                                          
##        'Positive' Class : anomaly        
## 
sensitivity = sum(k3_long == data_test$class & data_test$class == "anomaly") / sum(data_test$class == "anomaly")
recall = sensitivity
specificity =  sum(k3_long == data_test$class & data_test$class == "normal") / sum(data_test$class == "normal")
precision = sum(k3_long == data_test$class & k3_long == "anomaly") / sum(k3_long == "anomaly")
npv = sum(k3_long == data_test$class & k3_long == "normal") / sum(k3_long == "normal")
f1score = 2*precision*recall /(precision+recall)
cat("k=3 long f1score ->", f1score, "\n")
## k=3 long f1score -> 0.9849315
#k1_long
accuracy = sum(k1_long == cl_test_long) /length(cl_test_long)
error = 1-accuracy
cat("k=1 long error ->", error, "\n")
## k=1 long error -> 0.008891006
confusionMatrix(table(k1_long,data_test$class))
## Confusion Matrix and Statistics
## 
##          
## k1_long   anomaly normal
##   anomaly    5781     57
##   normal       55   6704
##                                           
##                Accuracy : 0.9911          
##                  95% CI : (0.9893, 0.9927)
##     No Information Rate : 0.5367          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9821          
##                                           
##  Mcnemar's Test P-Value : 0.9247          
##                                           
##             Sensitivity : 0.9906          
##             Specificity : 0.9916          
##          Pos Pred Value : 0.9902          
##          Neg Pred Value : 0.9919          
##              Prevalence : 0.4633          
##          Detection Rate : 0.4589          
##    Detection Prevalence : 0.4634          
##       Balanced Accuracy : 0.9911          
##                                           
##        'Positive' Class : anomaly         
## 
sensitivity = sum(k1_long == data_test$class & data_test$class == "anomaly") / sum(data_test$class == "anomaly")
recall = sensitivity
specificity =  sum(k1_long == data_test$class & data_test$class == "normal") / sum(data_test$class == "normal")
precision = sum(k1_long == data_test$class & k1_long == "anomaly") / sum(k1_long == "anomaly")
npv = sum(k1_long == data_test$class & k1_long == "normal") / sum(k1_long == "normal")
f1score = 2*precision*recall /(precision+recall)
cat("k=1 long f1score ->", f1score, "\n")
## k=1 long f1score -> 0.990406
#k5_short
accuracy = sum(k5_short == cl_test_short) /length(cl_test_short)
error = 1-accuracy
cat("k=5 short error ->", error, "\n")
## k=5 short error -> 0.0128602
confusionMatrix(table(k5_short,data_test$class))
## Confusion Matrix and Statistics
## 
##          
## k5_short  anomaly normal
##   anomaly    5751     77
##   normal       85   6684
##                                         
##                Accuracy : 0.9871        
##                  95% CI : (0.985, 0.989)
##     No Information Rate : 0.5367        
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.9741        
##                                         
##  Mcnemar's Test P-Value : 0.5823        
##                                         
##             Sensitivity : 0.9854        
##             Specificity : 0.9886        
##          Pos Pred Value : 0.9868        
##          Neg Pred Value : 0.9874        
##              Prevalence : 0.4633        
##          Detection Rate : 0.4565        
##    Detection Prevalence : 0.4626        
##       Balanced Accuracy : 0.9870        
##                                         
##        'Positive' Class : anomaly       
## 
sensitivity = sum(k5_short == data_test$class & data_test$class == "anomaly") / sum(data_test$class == "anomaly")
recall = sensitivity
specificity =  sum(k5_short == data_test$class & data_test$class == "normal") / sum(data_test$class == "normal")
precision = sum(k5_short == data_test$class & k5_short == "anomaly") / sum(k5_short == "anomaly")
npv = sum(k5_short == data_test$class & k5_short == "normal") / sum(k5_short == "normal")
f1score = 2*precision*recall /(precision+recall)
cat("k=5 short f1score ->", f1score, "\n")
## k=5 short f1score -> 0.9861111
#k3_short
accuracy = sum(k3_short == cl_test_short) /length(cl_test_short)
error = 1-accuracy
cat("k=3 short ->", error, "\n")
## k=3 short -> 0.01222513
confusionMatrix(table(k3_short,data_test$class))
## Confusion Matrix and Statistics
## 
##          
## k3_short  anomaly normal
##   anomaly    5761     79
##   normal       75   6682
##                                           
##                Accuracy : 0.9878          
##                  95% CI : (0.9857, 0.9896)
##     No Information Rate : 0.5367          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9754          
##                                           
##  Mcnemar's Test P-Value : 0.809           
##                                           
##             Sensitivity : 0.9871          
##             Specificity : 0.9883          
##          Pos Pred Value : 0.9865          
##          Neg Pred Value : 0.9889          
##              Prevalence : 0.4633          
##          Detection Rate : 0.4573          
##    Detection Prevalence : 0.4636          
##       Balanced Accuracy : 0.9877          
##                                           
##        'Positive' Class : anomaly         
## 
sensitivity = sum(k3_short == data_test$class & data_test$class == "anomaly") / sum(data_test$class == "anomaly")
recall = sensitivity
specificity =  sum(k3_short == data_test$class & data_test$class == "normal") / sum(data_test$class == "normal")
precision = sum(k3_short == data_test$class & k3_short == "anomaly") / sum(k3_short == "anomaly")
npv = sum(k3_short == data_test$class & k3_short == "normal") / sum(k3_short == "normal")
f1score = 2*precision*recall /(precision+recall)
cat("k=3 short f1score ->", f1score, "\n")
## k=3 short f1score -> 0.9868106
#k1_short
accuracy = sum(k1_short == cl_test_short) /length(cl_test_short)
error = 1-accuracy
cat("k=1 short ->", error, "\n")
## k=1 short -> 0.00817655
confusionMatrix(table(k1_short,data_test$class))
## Confusion Matrix and Statistics
## 
##          
## k1_short  anomaly normal
##   anomaly    5786     53
##   normal       50   6708
##                                           
##                Accuracy : 0.9918          
##                  95% CI : (0.9901, 0.9933)
##     No Information Rate : 0.5367          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9836          
##                                           
##  Mcnemar's Test P-Value : 0.8438          
##                                           
##             Sensitivity : 0.9914          
##             Specificity : 0.9922          
##          Pos Pred Value : 0.9909          
##          Neg Pred Value : 0.9926          
##              Prevalence : 0.4633          
##          Detection Rate : 0.4593          
##    Detection Prevalence : 0.4635          
##       Balanced Accuracy : 0.9918          
##                                           
##        'Positive' Class : anomaly         
## 
sensitivity = sum(k1_short == data_test$class & data_test$class == "anomaly") / sum(data_test$class == "anomaly")
recall = sensitivity
specificity =  sum(k1_short == data_test$class & data_test$class == "normal") / sum(data_test$class == "normal")
precision = sum(k1_short == data_test$class & k1_short == "anomaly") / sum(k1_short == "anomaly")
npv = sum(k1_short == data_test$class & k1_short == "normal") / sum(k1_short == "normal")
f1score = 2*precision*recall /(precision+recall)
cat("k=1 short f1score ->", f1score, "\n")
## k=1 short f1score -> 0.9911777

5.1 Explicabilidad

Los diferentes parámetros con los que se genera el algoritmo knn determinan la calidad de la predicción, viendo el error calculado podemos observar que en nuestro caso, es mejor predicción la que se consigue con el knn con menos variables. A su vez, podemos observar que, en nuestro caso, bajando el número de vecinos obtenemos un error menor en los tests.

6 Evaluación

Tras comparar los resultados en el entrenamiento, de todos los algoritmos probados, el que mejor parece funcionar es un knn con k=3 y con menos variables. Aunque k=1 ofrecía mejores resultados en el test, había overlapping, por lo que no es adecuado elegir ese algoritmo.

datos_train_t=data%>%
  select(protocol_type, service, flag, src_bytes, dst_bytes, wrong_fragment, logged_in, count, srv_count, serror_rate, srv_serror_rate, same_srv_rate, diff_srv_rate, srv_diff_host_rate, dst_host_count, dst_host_srv_count,  dst_host_same_srv_rate, dst_host_diff_srv_rate, dst_host_srv_diff_host_rate, dst_host_serror_rate, dst_host_srv_serror_rate)

datos_train_t$service=as.numeric(as.factor(datos_train_t$service))
datos_train_t$protocol_type=as.numeric(as.factor(datos_train_t$protocol_type))
datos_train_t$flag=as.numeric(as.factor(datos_train_t$flag))

datos_test_t=Test_data%>%
  select(protocol_type, service, flag, src_bytes, dst_bytes, wrong_fragment, logged_in, count, srv_count, serror_rate, srv_serror_rate, same_srv_rate, diff_srv_rate, srv_diff_host_rate, dst_host_count, dst_host_srv_count,  dst_host_same_srv_rate, dst_host_diff_srv_rate, dst_host_srv_diff_host_rate, dst_host_serror_rate, dst_host_srv_serror_rate)

datos_test_t$service=as.numeric(as.factor(datos_test_t$service))
datos_test_t$protocol_type=as.numeric(as.factor(datos_test_t$protocol_type))
datos_test_t$flag=as.numeric(as.factor(datos_test_t$flag))




result=knn(cl=cl,train=datos_train_t,test=datos_test_t,k=3,prob=TRUE)
table(result)
## result
## anomaly  normal 
##    8707   13837

7 Conclusiones

En nuestro caso parece que los decission tree funcionan bastante peor que el algoritmo knn. Además, al usar el algoritmo knn, si usamos 1 como valor de k, se produce un claro overlapping. Por otra parte, si usamos un valor bastante alto para k, como puede ser 5, tiende a empeorar. Por lo tanto, lo mejor ha sido encontrar un valor intermedio para k, en nuestro caso, 3.